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Probability  inequalities  are  given  for  the  devia- 
tion of  the  resubstitution  error  estimate  from  the 
unknown  conditional  probability  of  error.  The  inequali- 
ties are  distribution-free  and  can  be  applied  to  linear 
discrimination  -ules,  to  nearest  neighbor  rules  with  a 
reduced  sample  size,  and  to  histogram  rules. 

1.  Introduction 

The  discrimination  problem  may  be  formulated  as 


fol lows . 


The  statistician  collects  data  (Xj,6j),..., 


(X|^,e^),  a sequence  of  independent  identically  distri- 
buted random  vectors  drawn  from  the  distribution  of 
(X,e),  a random  vector  independent  of  the  data.  For 
each  1 < i s n,  the  observation  X.j  takes  values  in 

m ---■  - takes  values  in  {I,..., Ml. 


R and  its  state  0, 


The 


discrimination  problem  is  that  of  estimating  the  state 
a from  the  data  and  the  observation  X using  procedures 
which  do  not  require  complete  knowledge  of  the  distri- 
bution of  (X,e).  If  9 denotes  the  estimate,  that  is, 

9 * g(X.Vj,)  where  g is  a Borel  measurable  (1,...,M)- 


valued  function  of  X and  the  data  V 


- ( X j ,0  j , . 


Xj^,9^),  then  a measure  of  the  performance  of  the  pro- 


cedure given  the  data  is 
tional  probability  of  error. 


= P{9  t 9lV^),  the  condi- 


2.  Main  Results 


Let  be  known  measurable  mappings 

from  r"*  to  IR  where  m'  > 1 and  m > 1,  and  (fp  : 1. 

■ *“lm'  * "m  * ^“mO “mu'  ^ 


Let  Wq  = (wjQ 

Borel -measurabl e r"'  ^^-valued  vector  functions  of  the 
data  V^.  Then,  the  rule  which  assigns  the  state 

9=j  (l<j<M)toX  whenever  j is  the  first  integer 
for  which" 


Ha,  {£  ,,(>1) 

l<k<M  i=0 


is  called  a linear  discrimination  rule  (see  Duda  and 
Hart^  for  a survey  of  the  literature  on  linear  dis- 
crimination). We  emphasize  that  the  Wj,...,W|,^  may  be 

picked  in  an  arbitrary  fashion,  using  any  method  that 
can  or  cannot  be  found  in  the  literature.  The  func- 
tions are  picked  in  advance.  The  following  bound 

is  proved  in  the  Appendix. 


Theorem  1.  For  every  t > 
crimination  rules  with  given  , . . . ,»^ 

substitution  estimate  L_  satisfies 


0 and  for  all  linear  dis- 
the  re- 


Since  the  distribution  of  (X,0)  is  unknown,  there 
is  in  general  no  way  of  computing  L^  from  the  data. 

Using  the  data  the  statistician  may  try  to  estimate  L^ 
by  L^.  A survey  of  estimation  techniques  can  be  found 

in  Toussaint.'  One  of  the  oldest  estimates  is  the  re- 
substitution estimate 


P(|Ln-L^|  > e)  < 4M{  l + ( 2n)’"' . 

For  the  interesting  case  that  M*2,  we  see  that 

2 

P1|L„-LJ  > Ei  ; 8(H(2n)'"  ) e"""  . 


S ^<»i  ''  «i> 


where  0.j  * g(X^,V^),  1 < i < n,  are  the  estimates  of 
the  states  of  Xj,...,X^  with  the  given  discrimination 
procedure,  and  where  I is  the  indicator  function. 

In  this  paper  we  obtain  upper-bounds  for 
P{|L^-L^|  > c)  that  do  not  depend  upon  the  distribution 

of  (X,0),  and  that  are  applicable  to  three  large  classes 
of  discrimination  rules, 

(i)  the  linear  discrimination  rules, 

(ii)  the  nearest-neighborj:;ules  with  reduced 
sample  size,  and  r: 

(iii)  the  histogram  decision  rules. 

The  existence  of  distribution-free  bounds  with  the  re- 
substitution  estimate  for  linear  discrimination  rules 
was  first  noticed  by  Vapnik  and  Chervonenkis. The 
bounds  for  the  class  (1)  improve  the  bounds  given  in 
Devroye  and  Wagner^,  while  the  results  for  the  rules 
(11)  and  (ill)  are  new.  The  possible  existence  of 
distribution-free  bounds  for  (11)  was  suggested  to  the 
authors  by  Dr.  Penrod’®. 

* This  work  was  supported  In  part  by  the  Air  Force 
under  Grant  AFOSR  72-2371. 


Using  the  Borel -Cantel 1 1 lemna  and  Theorem  1,  we  see 
that  for  a given  m'  and  M,  and  uniformly  over  all 

linear  discrimination  rules,  |L  -L  I " 0 with  proba- 

bility  one,  a result  due  to  Click”.  Thus,  the  statis- 
tician could  pick  the  Wj,...,W|^j  that  minimize  the  re- 

substitution  estimate  L because  he  knows  from  Theorem 
n 

1 that  the  corresponding  probability  of  error  L^  will 
be  very  close  to  L^,  and  that  for  large  n,  minimizing 
L^  is  nearly  equivalent  to  minimizing  L^  (see  Wagner*^). 

In  the  literature  special  attention  has  been  given 
to  the  nearest-neighbor  rule  with  a reduced  number  of 
observations  where  the  reduction  is  a result  of  editing 
(see  for  Instance  Wilson'*),  condensing  (Hart®)  or  any 
other  operation  (Tomek®).  In  general,  we  end  up  with 

R^xd M)-valued  random  vectors  (’IihCj) 

where  K is  an  integer-valued  random  variable  with 
1 <■  K < k . The  (V^.L^)  and  K may  depend  upon  the  data 

In  an  arbitrary  fashion.  A new  observation  X is  as- 
signed the  state  ^ whenever  j Is  the  smallest 
Index  for  which 
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do  not  imply  one  another. 


|X-Y,||  = Min  HX-YJI  . 
^ I<i<K 


Thus,  8 is  the  state  of  the  nearest  neighbor  to  X among 
Yj,...,Y|^.  In  the  Appendix  the  following  Theorem  is 

proved. 


Theorem  Z.  For  every  e >0,  and  for  all  the  nearest- 
neighbor  rules  with  reduced  sample  size,  the  resubsti- 


tution estimate  satisfies 


m ‘‘n'l  -ne^/8k^ 

P{|L^-L^|  > c)<  4M(l+(2n)'")  " e " 


where  is  an  upper-bound  on  the  reduced  sample  size. 


We  remark  that  this  bound  is  independent  of  the 
distribution  of  (X,e).  The  bound  converges  to  0 as  n 
grows  large  provided  that  the  sequence  kj,k2,...  is 

picked  in  such  a way  that  k^log  n/n  5 0.  It  is  clear 


that  this  bound  is  useless  for  the  well-known  nearest 


neighbor  rule^,  that  is,  the  rule  with  X=k^=n  and 


= (X^,e^),  l;i<n.  This  was  to  be  expected 


because  the  resubstitution  estimate  with  the  nearest- 
neighbor  rule  is  overly  optimistic.  In  fact,  if  the 
probability  measure  w of  X is  absolutely  continuous 


with  respect  to  Lebesgue  measure,  then  = 0 with 


probability  one,  no  matter  what  value  takes. 


Theorem  2 can  be  useful  for  reduced,  selective, 
condensed  •.  edited  nearest-neighbor  rules‘*"^’*^"*'‘. 
If  k^  is  a prespecified  number  of  (Y^,t^)‘s  that  are 

to  be  used  in  the  new  nearest-neighbor  rule,  then  the 


statistician  could  compute  with  some  selected  data 


(X.  ,9,  ) (X. 

’l 


.9 


•<n 


) where  (ij i|^  ) is  a sub- 


set of  (l,...,n),  and  decide  to  use  that  set  of  indices 
for  which  the  resubstitution  estimate  is  minimal.  Using 
Theorem  2,  we  also  know  how  much  confidence  we  can  put 


in  our  estimate  regardless  of  the  selection  procedure 


of  the  (Y^,£^)  and  without  any  knowledge  of  the  distri- 


bution of  (X,e). 


The  (Y^,5^),  l<i<k^,  partition  R"  into  k^  disjoint 


sets  A,,..., A.  where  the  state  of  X is  estimated  by 
1 


5j  whenever  X takes  values  in  Aj  (that  is,  X is 


closest  to  Yj).  The  partition  in  this  case  depends  on 


the  data  because  the  Y^  depend  upon  the  data.  For  a 
given  fixed  partition  of  R™,  we  can  expect  to  obtain 


tighter  upper-bounds  for  >c)  even  if  the  par- 


tition is  not  generated  by  a reduced  nearest-neighbor 
rule. 


Let  Aj,...,A|^  be  any  fixed  partition  of  R™  and 


let  c 


j... 


,L|^  be  (1 M}-valued  random  variables 


where,  as  before,  is  the  state  assigned  to  X when- 


whenever  X takes  values  in  A..  Such  rules  will  be 


r 


called  histogram  decision  rules.  We  prove  the  following 
four  distribution-free  inequalities  that  are  valid  no 


Theorem  3.  For  a given  k^-member  partition  of  R , for 


any  way  of  specifying  Cj from  the  data  in  a 


histogram  decision  rule,  and  for  every  e > 0,  the 


resubstitution  estimate  L satisfies 
n 


> e)  < q . 
- =’m 


1 < i < 4, 


where 


^nl  ■ ^ (l+2n,M)  e 


2 2 
-nc'^/Sk^ 


^n2 


= 2 k^  M e 


5 2,„2.2 

-2nt  /M  k 

n 


9n3  = 4 Min  (M  ^2'^^(4n/k^)  ") 
k. 


»-nc^/8 


, 2 „-n 


We  note  here  that  g^j  and  useful  even  if 

M = * ( i .e. , the  t ^ and  8 ^ can  take  a countably  infi- 
nite number  of  values).  Clearly,  all  the  g^^  are 
independent  of  the  dimension  m and  the  distribution  of 


(X,9  ) , and  g^2  " 0 provided  that  k^/n  " 0.  If  k^  = « 


the  bounds  are  not  applicable,  and,  as  wewill  see,  the 


resubstitution  estimate  does  not  possess  the  distribu- 
tion-free properties  that  it  has  with  finite  partitions 
,m 


Assume  that  Aj.A^,...  is  a fixed  countably 


of  R 

infinite  partition  of  R™. 
variables  that  are  independent  of  the  data,  then 


If  the  c.  are  random 


'’^Mnl 


> c)  < 2e 


■2ne^ 


(1) 


for  any  c > 0.  However,  such  rules  are  impractical 
The  closest  one  can  come  to  the  Bayes  rule  with  a 


fixed  partition  is  to  let  L.j  = j if  j is  the  smallest 


integer  such  that  N..  = Max 
’u  l<t<M 


N.  „ where  N. . is  the 

It  ij 


number  of  (X|^,9|^)'s  with  X|^eA.  and  6|j“j.  But  even  with 
this  obvious  choice  of  s®®  that  for  any  m 

and  M > 2,  there  always  exists  a distribution  of  (X,e) 
such  that  |Lj,-L^|  ; S with  probability  one.  Indeed, 

assume  that  M=2,  that  9*2  with  probability  one,  and 
that  X takes  values  in  each  A^ A2^  with  equal 

probability  l/2n.  If  the  are  picked  as  described, 

then  the  resubstitution  estimate  equals  0.  Further- 
more, 

2n 


£ PIXcA^l 


> n/2n  • 1/2. 


This  shows  that  even  with  the  most  obvious  dependency 
of  the  ' on  the  data,  we  will  never  be  able  to  upper- 

t 

bound  P( |L, 


> c)  by  an  expression  that  decreases 


matter  how  the  depend  upon  the  data.  The  inequalities 


to  0 as  n grows  large,  uniformly  over  all  distributions 
of  (X,9). 


1 
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3.  Appendix 

Proof  of  Theorem  1. 

Let  V be  the  probability  measure  of  (X,o)  where  X 
rn ' 

takes  values  in  K and  0 takes  values  in 

It  is  clear  that  if  is  the  empirical  measure  for 

(Xj  ,0  j) . ,(X^,0^) , and  if  Aj is  the  partition 

m ' 

of  R that  is  generated  by  the  linear  discrimination 
rule  (that  is,  A.  is  the  set  on  which  we  estimate  the 

state  of  X by  i),  then 

Ln  = E v(A.x  (i}<=) 

" 1=1  ’ 

and 

Ln  = g • 

Thus, 

IL^-LJ  = lE  (v(A^x{i)")  - v„(A.x{i}<=))| 

M 

= IjE  (vn(A^x{i})  - v^(A^x{i}))| 

< M sup  |v  {Ax(1})  - v(Ax{i))| 

' Acjif  " 

icU M) 

where  .Vis  the  class  of  all  sets  that  are  intersections 

of  (M-1)  linear  halfspaces  of  R™  . We  recall  that  a 

linear  halfspace  of  r""  is  a set  of  x = (x* x ) 

for  which  x^a,  + ...  + x"'  a„,  > a„  or  x^a,  +...+  x”"  a„i 
1 m - u A m 

1^1 

< ag  for  some  (ag,aj,. . . .aj^,  )cR  Thus,  every 

(8g,aj a^^,)  defines  two  linear  halfspaces. 

By  an  inequality  of  Vapnik  and  Chervonenkis®, 

P{|Ln-Cnl  -> 

< P{sup  lv„(Ax{i})  - v(Ax{i))|  > e/M) 

■ Atoif  " 
l<i<M 

where  .«*.Vx{{l),...  ,{M))  and  st*,n)  is  the  maximum 
over  all  (xj ,yj) ,. . . ,(x^,y^)  in  r"'  x(1,...,M)  of  the 
number  of  different  sets  in  |({Xj  ,yj)U. . .U{x^,yj^)) 
nB|Be.3f|.  If  j/1s  the  class  of  all  linear  halfspaces 

of  R*"  and  M=l,  then  s(t9,n)  < l+n”  by  a theorem  of 
Cover*  (see  also  Vapnik  and  CRervonenkis*) . It  is 
clear  that  if  .Vis  the  class  of  all  intersections  of 
K»-l  or  less  linear  halfspaces  and  M=l,  then 

sfcB,n)  < (1+n"'')”*'^  If  M>1,  then  s(«,n)  < Md+n"’')"*'^ 
Indeed, *1f  Sj  is  the  number  of  different  sets  in 

j({XjHI...U{x^))nA|Ac.v|,  then  the  number  of  different 

sets  in  j{{Xj,yj)U...U{x^,y^))TB|Bcrfx{{n {M))f 

is  at  most  MSj.  Thus  we  have  shown  that 


P{|L„  - L^l  > e)  i 4M(l+(2n)™')”"^  e'"'"  . 

Q.E.D. 

Proof  of  Theorem  2. 

Let  us  use  the  notation  of  Theorem  1 where  we  let 

Aj A|^  be  the  partition  of  r”"  that  is  generated  by 

the  nearest-neighbor  rule  with  (Y^,C  j^) ,. . . ,(Y|,,,t  j,) 
(i.e.,  A^  is  the  set  on  which  we  estimate  the  state  of 
X by  5.  and  for  which  is  the  nearest  neighbor  to  X 
among  Yj Y|^),  then 

= E v(A.xU,.)") 


i=l 


and 


Ln=  Ev„(A.x(,.)^)  . 

Thus,  arguing  as  in  Theorem  1,  we  have 

l‘-n'^nl‘  |v  (Ax{i))-  v(Ax{i})| 

l<i<M 

where  .Vis  the  class  of  all  sets  that  are  intersections 
of  (kj^-1)  or  less  linear  halfspaces  of  r".  Since 
K < k^,  we  have  by  the  argument  of  Theorem  1 that 

m •'n'l  -nc^/Bk  ^ 

P{1L„-L^l>cl  < 4M(l+(2nr)  " e " . 


Q.E.D. 


Proof  of  Theorem  3. 

It  is  clear  that 
k_ 


1*1  * 1*1  ’’ 

kn  k^ 

= |v(  U (A  xU  )))  -Iv„(  u (A  xtr  )))1 
t=l  1=1  ‘ 


5 E sup  |v(A  x{1>)  - V.(A  x{i))|  . 
t=l  l<i<M  '■  n t ' 

Thus,  if  is  the  class  of  sets  of  the  form  A^x(i), 

l<i<M,  then  we  know  by  an  inequality  of  Vapnik  and 
Cfiervonenkis*  that 

P<l>-n-Lnl  -> 

< k sup  PIsup  |v(C)-v  (C)|  > c/k  ) 

■ " l<t<k  Cc«  " ■ " 


< 4k  / sup  s(iP  ,2n)\e 

■ U<t<k„  ‘ / 

- n 


-n(t/k„)VB 


< 4k^  M1n(l+2n,M)  e 
Also, 


-nt^/Bk  ^ 
n 


D D C 

fn]I?  (7:irLYinj22Fn 


AM  15  I97-'  i‘ 


II 

LV=! 


ILliDlIilj  D Lb( 

A 
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msssm 


P(|L„-LJ  i c)  . 


< P E E h(A.'‘{i))  - V (A  x{i))|  > e ■ 

>£=1  1=1  ‘ " ■ ) 

< k M sup  P(|v(A  ={i})  - v-(A  ={i))|  > e/k  M) 

" l<i<M  - n 

l<£<k 


< 2k„Me 


-2ne^/M^k  ^ 
n 


by  an  inequality  of  Hoeffding*^.  Furthermore, 


iLn-Lnl 

kn  k^ 

- sup  |v(  U (A.xix  }))-v  ( U (A  ={X  )))| 

all  ) £=1  “ "1=1“' 

‘ •‘n 

k„ 

from  {1 M)  " 

so  that 

2 

P{|L^-L^|>c}  < 4s(ai*,2n)  e""^ 

where is  the  class  of  all  sets  of  the  form 

'‘n  k 

U (A  »{X  })  where  (X, X.  je®  = {1 Ml 

£=1  ^ ''n 

Clearly,  s(£^*,2n)  < 2^”  for  all  k^.  However,  if 

k 

k^  < 2n,  then  s(®*,2n)  < M " and,  in  general,  we  must 
k k 

have  that  s(ai*,2n)  < 2 ”(2n/k  ) ",  This  proves  the 


inequality  with  g^j. 

Finally,  notice  that 


Pt|l-n-Lnl 


kn  k^ 

i E P<l''(  U {A  x(X  )))  - V ( U (A^={X  1))|  > e/Mk^) 

de®  t»l  £*1 

< 2M  " e-2"^  . 


Proof  of  (1). 


Inequality  (1)  is  a corollary  of  Hoeffding's 
inequality's  if  we  note  that  |Lj,-Lj^|  = |v(C)-v^{C)| 
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